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ABSTRACT 



The Naval Postgraduate School ' s Student Opinion Forin 
data were subjected to study through the use of two cluster 
analysis techniques; (1) K-MEANS partitioning method and 
(2) Chernoff 's FACES. Much developmental work was performed 
to tailor these methods to the special requirements of the 
data set. A thorough multivariate statistical review pro- 
vided the basis for choosing optimality criteria and distance 
functions for use in the MIKCA (Multivariate Iterative K-MEANS 
Clustering Algorithm) . Alterations were made to the computer 
code to allow the analysis to include the effect of class 
size on cluster membership. Use of the linear discriminant 
function aided in identifying variables for use in constructing 
features of the computer-drawn faces. This approach to the 
Chernoff 's FACES technique shows promise but needs further 
development. A principal components analysis of the data 
showed it to be essentially one dimensional. Partitioning 
the data into four clusters shows that the scoring of the 
courses varies inversely with class size. 
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I. 



INTRODUCTION 



The Student Opinion Form (SOF) used at the Naval Post- 
graduate School provides an organized information gathering 
mechanism about each course (and its instructor) as per- 
ceived by the students. The information obtained from the 
SOF data is used for administrative review of faculty 
performance and for feedback to the instructor to aid in 
self-development. The former use is hampered by the fact 
that the data are multivariate in nature and represent a 
complicated set of interactions between the instructor's 
performance, the nature of the course, and the group of 
students. There is need for methodology which can disen- 
tangle those interactions and summarize the data in a 
meaningful way. 

It is the purpose of this thesis to develop suitable 
cluster analysis methods for studying the data and dis- 
covering any hidden structure they may possess. Concurrently, 
a certain amount of exploratory data analysis took place, 
and those results are reported also. 

At the completion of every quarter, students are requested 
to respond to a 16-item SOF questionnaire for each course 
in which they are enrolled. The data are viewed as an n by p 
matrix, representing n observations (SOFs) , each of which is 
measured on p (16) different variables. For this research 
the mean vector of each course was computed. Then attempts 
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were made to discover natural clusters of these mean vectors 
which in turn can be interpreted as the underlying structure 
in the data. Since the number of students per course is 
quite variable, the mean vectors are not equally well 
determined. Also, the matrix of mean vectors may have a 
covariance structure quite different from that of the full 
n by p data matrix. 

The clustering objective was pursued by two multivariate 
statistical methods; one computer-graphic technique referred 
to as Chernoff's FACES, and a second, more mathematically 
oriented approach called K-MEANS. The former produces computer- 
drawn cartoon faces, the features of which are controlled 
by variables in the data. The assignment of variables to 
features was aided by the use of linear discriminant analysis . 
One face is produced from each course mean vector, and then 
the researcher is able to study the appearance of the faces 
and cluster together those that display similar character- 
istics. The second method utilizes a computer program 
called MIKCA (Multivariate Iterative K-MEANS Clustering 
Algorithm) which is based on the K-MEANS method. It forms 
an initial partition of the data and then transfers obser- 
vations between clusters in order to improve an optimality 
criterion function. In this iterative manner, MIKCA ulti- 
mately stabilizes and provides an "optimal” cluster solution. 

In addition a modified MIKCA technique was employed. 
Alterations were made to the basic computer code to enable 
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the program to weight each mean vector by the number of 
students in the course. This modification may be likened 
to a one-way Analysis of Variance (ANOVA) having unbalance 
in the number of observations per treatment. The result is 
to stabilize the relative variability of the various course 
mean vectors. 

Most multivariate analysis methodology is derived assuming 
the data have a multivariate normal distribution with common 
covariance matrix. The performance of the MIKCA program 
and the linear discriminant analysis will not depend greatly 
upon this assuraption provided the clusters are well defined. 

On the other hand, if the clusters are not well separated, 
the results of the programs will be sensitive to these 
assumptions, and this is the condition anticipated. Accordingly, 
a transformation was sought toward this end. The one 
selected is essentially a logistic function. 

It is frequently necessary to compare the agreement of 
cluster solutions produced under different conditons or by 
different methods. For this purpose, a computer program 
was written which provides an ad hoc measure of the amount 
of agreement between the results of two or more solutions. 

A nuit±)er between zero and one, called the comparison coeffi- 
cient, is the resulting measure of association. 

This thesis was largely exploratory and should serve 
as a firra foundation for future study of the SOF data in 
particular and similarly structured multivariate data sets 
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in general. A number of unexpected questions are raised 
during the exploratory phases of this research. It was 
not possible to answer many of these questions, and their 
consideration is left to other researchers. During the 
development of the methodologies, some new and challenging 
problems were encountered. Many of these had to be given 
rather short treatment in the interest of meeting the original 
objectives. It should be emphasized that although some very 
interesting facts are revealed in this thesis, the results 
are by no means considered to describe completely the 
information hidden in the data. 
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II. CLUSTER ANALYSIS 



A. ORIGIN AND THEORY 

Cluster analysis is the name given to a body of diverse 
techniques for discovering taxonomical structure within bodies 
of data. It is one of several methodologies included in 
the broader category called classification. In cluster 
analysis little or nothing is known about the category 
structure. All that is available is a collection of obser- 
vations whose category memberships are unknown, and one must 
discover a category structure which fits the observations. 

The objective is to find the natural groups by sorting the 
observations such that the association is high among members 
of the same group and low between members of different groups. 
The great challenge to the researcher is finding the most 
appropriate way of defining "natural groups" and "association." 
Cluster analysis is closely related to and often confused 
with discriminant analysis, a statistical procedure for 
assigning new observations to known groups. In contrast 
to discriminant analysis, clustering refers to discovery of 
the initial groups. 

Although modern clustering techniques began development 
in biological taxonomy, they are generally applicable to 
all types of data. Any method which partitions a set of 
objects into subsets on the basis of measurements taken on 
every object qualifies as a clustering method. Cluster 
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analysis techniques are most often applied in multivariate 
settings, that is where each of n observations is measured 
on p different variables. A clear intuitive picture of the 
concept is helpful in appreciating the value of cluster 
analysis and the situations to which it might be applied. 

In a geometric sense, every object (observation) may be 
viewed as a point in p-dimensional Euclidean space. This 
swarm of data points may contain dense regions or "clouds" 
of data points which are separable from other regions 
containing a low density of points. These denser regions 
constitute what are known as clusters. In the one and two 
dimensional cases, it is easy for the human eye to quickly 
detect the clusters from scatter plots, assuming that the 
clusters exist. In higher dimensions, clustering attempts 
become extremely difficult without the aid of computers. 

Solutions to the clustering problem usually involve the 
determination of a partition which satisfies some optimality 
criterion. The optimality criterion is a way of measuring 
how good a particular cluster solution is relative to other 
solutions. An astounding number of possible solutions exist. 
Reference 1 describes a Stirling number of the second kind 
representing the number of ways n objects may be sorted 
into m groups. 



s(m) 

n 



ml 



m 

I (- 1 )"*"^ 

k=0 



k 



n 
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The number of groups is usually unknown so the problem is 
compounded, and the total number of possibilities is a sum 
of Stirling numbers. In the case of 25 observations, the 
total number of possible cluster solutions is 



25 
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25 
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which exceeds 4 x 10 . This illustrates that the enumerative 

technique for finding solutions can require huge amounts of 
computer time, and there exists a need for a better way. 

Modern techniques allow solutions to be found without 
evaluating the criterion for each and every solution. How- 
ever the need for ranking solutions is evident, and the 
criterion function serves to meet this need. A wide variety 
of such functions exists, and the choice is usually determined 
by the particular characteristics of the research being con- 
ducted. A more detailed discussion of optimality criteria 
is presented in Section II. C. 

Mathematical clustering techniques usually call for a 
concept of distance between objects. In order to solve the 
cluster problem, it is desirable to define the terms "simi- 
larity” and "difference" in a quantitative fashion. What 
does it mean to say two objects are different? Perhaps an 
investigator would assign two observations to the same group 
if the distance between them is sufficiently small, or to 
different clusters if this distance is sufficiently large. 
Common reference to the closeness of objects is made in 
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units of length, weight, or time. Numerous methods for 
measuring distance will be discussed in Section II. D. 



In the following discussion, and represent two 

points in p-dimensional Euclidean space (E ) corresponding 

P 

to objects or observations. Any non-negative real-valued 



function D(X^,Xj) satisfying the following conditions 



qualifies as a distance function (or metric) . 

a. D(X^,Xj) ^ 0 for all X^ and X^ in E^ 

b. D(X.,X.) = 0 if and only if X . = X. 

c. D(X^,Xj) = D(Xj,X^) 

d. D(Xi,Xj) < D(X^,Xj^) + D(Xj^,Xj) 

where X., x., and X. are any three points in E . Later 

1 j K p 

discussions will place particular emphasis on the Mahalanobis 
metric. 

The use of cluster analysis is applicable in nearly 
every field of study. The literature is both voluminous and 
diverse, the terminology differing from one field to another. 
"Numerical taxonomy" is frequently substituted for cluster 
analysis among biologists, botanists, and ecologists, while 
some social scientists may prefer "typology." Other fre- 
quently encountered terms are pattern recognition and par- 
titioning. While discriminant analysis has been studied by 
statisticians for nearly 45 years, cluster analysis has only 
recently come to statistical notice. 

Cluster analysis is an exploratory device, a tool for 
suggestion and discovery. A question often asked is "How do 
you know when you have a good set of clusters?" The answer 
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is that the clusters themselves are not interesting; the 
point of interest is in inference about the structure of 
the data. The clusters do not explain the structure; they 
are consequences of the structure. The explanatory struc- 
ture is the object of the search and its description is 
in terms of principles and ideas, not individual data units. 

It is important to realize that a given set of data 
may contain no "right" classification, but possibly many 
different, meaningful classifications. It could be the 
case that the data contain no clusters at all. 

B. SCATTER MATRIX DECOMPOSITION 

Described in this section are the multivariate terminology 
and notation to be used on this thesis. The literature 
contains as many different notational structures as authors. 
The emphasis is on simplicity, while also exposing the reader 
to some of the more common terminology. 

In general, multivariate data are viewed as an n by p 
matrix referred to as X. It represents n observations, each 
of which consists of measurements on p different variables. 

The cross products matrix is analogous to the univariate sum 
of squared deviations from the mean and is represented by 
the p by p matrix T. 

n. 

g 1 

T = I I (x. . - X ) (x. . - X ) ' 

.LJ •• XJ «• 

i=l j=l 
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where 




X 



is the grand mean vector of the data. 



g is the number of groups . 



is the number of observations in the i-th 
group . 



Prime notation indicates transpose. 



All vectors are column vectors. Cross product matrices 
are also referred to as scatter matrices. Division of T 
by n-1 (where n represents the total number of observations) 
yields the total variance-covariance matrix, sometimes referred 
to as a dispersion matrix. 

The total sum of squares (cross products) matrix may 
be expressed as the sum of the within-group and the between- 
group scatter matrices: 



T 



W + B 



W and B are defined as follows: 



W 




i=l j=l 



B 



g 

I (x^ - X ) (x^ - X ) 

i=l 
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where is the mean vector of the i-th group. Each 

individual group has its own scatter matrix and W is 

the sum of these matrices: 



g 

w = y W. 

1 

i=l 

This discussion is intended to be completely general, 
with no particular group structure in mind. Later we 
shall explore the differences in the two group structures 
represented by the SOF data. 

(1) Individual SOFs are considered to be the obser- 
vations and the courses are the groups. 

(2) The course mean vectors are the observations and 
the clusters of courses are the groups. 

These two group structures are different ways of viewing 
the data; their relationship shall be explained in Section 
IV. B. 

C. OPTII'dALITY CRITERIA 

Most of the well known clustering techniques fall into 
one of two main categories; (1) hierarchical and (2) par- 
titioning. The former class is one in which every cluster 
obtained at any stage is a merger of clusters at previous 
stages. The non-hierarchical procedures however form new 
clusters by lumping and splitting old ones . 

Partitioning methods were used in this research. The 
main idea is to choose some initial partition and then alter 
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the cluster membership in an effort to improve the partition. 
Different interpretations of what constitute a "better" 
partition and numerous ways of achieving this improvement 
have led to a great variety of algorithms. These methods 
are related to the steepest descent algorithms used for 
unconstrained optimization in nonlinear programming. Such 
algorithms begin with an initial point and then converge 
to a local optimum, moving one step at a time, the value of 
the objective function improving at each step. A well known 
example is the ISODATA procedure developed by Ball and Hall 
at Stanford Research Institute. Chapter IV discusses a 
partitioning method known as K-MEANS which was developed 
by MacQueen [2]. He uses the term "K-MEANS" to denote the 
process of assigning each data unit to that cluster (of 
k clusters) with the nearest centroid (mean vector) . The 
cluster centroids change with each transfer of an observation. 

The decomposition of the total scatter into within 
and between components suggests possible optimality criteria 
to be used in a clustering algorithm. One would like the 
within-groups scatter to be small relative to the between- 
groups scatter. Various trial clusterings could be formed 
using the W and B matrices as a basis for the optimaltiy 
criteria which determine the best clustering. A possible 
choice for a criterion is to minimize trace W over all 
partitions into g groups. Since T is constant over all 
partitions, minimizing trace W is equivalent to maximizing 
trace B since 
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trace T = trace W + trace B 

Although trace W is invariant under an orthogonal transfor- 
mation, it is not invariant under other non-singular linear 
transformations . 

McRae [3] points out that trace W equals the total within 
group sum of squares, hence the "minimum variance partition" 
cluster solution is found by minimizing trace W. 

Considerable study has been devoted to alternative 
criteria such as those based on multivariate statistical 
analysis techniques, especially the methods of linear 
discriminant analysis and multivariate analysis of variance. 
Assiiming the p variables are not linearly dependent, then 
as long as p £ n-g, VJ is positive definite symmetric and 
so is W Attempts to make B and W as different as possible 

lead one to solving the determinantal equation: 

|B - AW| = 0 

The solutions are the eigenvalues of the matrix W ^B. 

There are t non-zero eigenvalues, where t is the minimum 
of p and g-1. This is a consequence of the fact that, if 
g is less than p, the g group means are contained in a 
(g-1) -dimensional hyperplane. When g = 2 the analysis is 
equivalent to two-group discriminant analysis. Linear 
discriminant analysis would take the vectors originally 
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1 



described in a p-dimensional coordinate system and trans- 
form the basis to a t-dimensional system. Maximizing the 
largest of these eigenvalues is a criterion suggested by 
S.N. Roy. Maximizing the trace of W however is a 

criterion known as Hotelling's trace criterion. In both 
cases, large values for these statistics are sought in 
clustering algorithms since large values indicate large 
differences among (between) groups. Minimizing the ratio 
of determinants |w| — |t| is a criterion widely known as 
Wilks' lambda. Since T is the same for all partitions, 
this criterion is equivalent to minimizing det W. 

Both trace W ^B and |T| -f- |W| may be expressed in terms 
of the eigenvalues of W ^B. 



T 

W 



t 

n (1 + x^) 

i=l 



t 

trace W ^B = J X. 

i=l 



where t = min(p,g-l). Therefore minimizing det W is 
equivalent to maximizing ird+A,^). 

Friedman and Rubin [4] describe the advantages of the 
various criteria. Those based on multivariate statistical 
considerations (all but trace W) are invariant under changes 
in scale for the variables (non-singular linear transformation) . 
In fact, they are the only invariants for W and B under such 
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transformations. In addition, the multivariate criteria 
may take into account covariation among the variables. 

D. DISTANCE CONSIDERATIONS 

As indicated earlier there exist a number of choices 
for measuring distance between objects. The choice of 
distance function if no less important than the choice of 
variables to be used in the study. A serious difficulty 
lies in the fact that knowledge of the clusters changes the 
choice of distance functions. In the computation of the 
distance, a variable which distinguishes well between two 
established clusters might be weighted more heavily than 
others. Friedman and Rubin describe this difficulty as 
the "bootstrap" nature of the problem. Knowledge of the 
clusters would suggest an appropriate distance function 
which in turn would allow one to detemine the original 
clusters. The trace W criterion implies ordinary Euclidean 
distance and thus hides this circularity. Use of the cri- 
teria which are invariant under non-singular linear trans- 
formations deals effectively with this circularity. 

The familiar Euclidean distance is illustrated in 
figure la. When p = 2 the geometric interpretation of this 
measure amounts to determining distances by circles. Two 
points such as A and B on the same circle are considered 
equidistant from the origin, while other points such as 
C and D are further from the origin than A and B. 

A general class of squared distance functions is provided 
by utilizing positive definite quadratic forras. Specifically, 
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if 6 represents a p-dimensional observation to be assigned 
to one of s groups, then to measure the squared distance 
between B and the centroid of the i-th group one may 
consider the function 

D. = (B - X. )'^ M (B - X. ) (1) 

1 1 • 1 • 

where M is a positive definite matrix to ensure that 
^ 0 . Different metrics are represented by different 
choices of the matrix M. When M = I (the identity matrix) 
the resulting metric is the standard Euclidean distance. 

The variance within the data may make the unweighted 
Euclidean metric inappropriate. Referring to figure lb 
where x has a larger variance than y, one may wish to weight 
a deviation in the x direction less than an equal deviation 
in the y direction. A method for accomplishing this is 
through use of an elliptical (weighted Euclidean) distance 
function which makes points A and B equidistant from the 
origin. The matrix M in this case is diagonal with diagonal 
elements equal to the reciprocals of the variances of the 
different variables. Insofar as the variance represents the 
true structure in the data, this distance function will 
adjust for differences due to the scale of measurement of 
each of the variables. Extending this idea further, one 
may consider the covariance among variables as well. Figure 
Ic shows how the axes may be tilted so that the major axis 
is oriented in a direction of reflecting the positive 
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correlation between x and y. Again, points on the same 
ellipse are considered equidistant from the origin. The 
matrix M in this case is the inverse of the covariance 
matrix. 

Further examination of this concept is an important 
consideration in this research. If represents the co- 
variance matrix of the i-th cluster then the distance 
function 



D. = (6 - X. )'^ C.^ (6 - X. ) 

1 1 * 1 1 - 

uses the appropriate covariance structure when determining 
distance to a particular cluster centroid. Note that the 
number of observations in every cluster must exceed the 
dimensionality p in order to preserve the nonsingularity 
of Since changes to reflect the dispersion internal 

to each particular cluster, the use of this metric exploits 
differences in the dispersion characteristics of the 
different groups. Figure Id illustrates the idea. Note how 
a new observation (denoted by u) is closer to the centroid 
of group one (Gl) in terms of Euclidean distance but is 
more likely to be assigned to group two (G2) when using the 
matrix. It is instructive to point out here that if one 
were looking for boundaries dividing the p-dimensional 
space into regions, one for each of the g groups, such 
boundaries would be non-linear. In the performance of 
discriminant analysis, Eisenbeis [5] suggests appropriate 
quadratic classification rules. 
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Another choice for the M matrix in equation 1 is 
C ^ where C represents the pooled within groups covariance 
matrix of all the clusters. 



C 



I (n^-l) 

i=l 



Recall from Section II. B: 



g 

W = [ 

k=l 



This distance is the well known Mahalanobis distance. 

Note that C does not change from group to group. To ensure 
the non-singularity of C it must be true that p _< (n-g) 
where 



g 

n = I 

i=l 

n represents the total number of observations over all 
groups . 

The use of the Mahalanobis metric in the original p- 
dimensional space is equivalent to using Euclidean distance 
in the t-dimensional discriminant space with basis vectors 
corresponding to the eigenvectors of W ^B. Mote that the 
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determination of the discriminant space was based on the 
assumption of homogeneity of the cluster covariance struc- 
ture. The Mahalanobis distance function therefore adjusts 
for both scale of measurement of the variables and covaria- 
tion among the variables . Use of this metric is equivalent 
to computing distances on variables transformed to their 
principal components . 

The natural metric to use with the trace W criterion is 
the Euclidean distance. However, when using criteria based 
on multivariate statistical considerations, Mahalanobis is 
the natural metric to use. 

When the clusters are distributed as p-variate normal 
and have equal covariance matrices, then Fisher's linear 
discriminant function is applicable, as is the Mahalanobis 
distance. The accuracy of the Mahalanobis metric is sensi- 
tive to the homogeniety of the cluster dispersions and 
decreases as the difference between the group dispersions 
increases. Recall the density function for the multivariate 
normal distribution 



f (x;y, J]) 



1 

(27T)P/^(det 




(x-y) 



where J is the covariance matrix and y is the mean vector 
of the distribution. Note the exponent which implies 
utilization of Mahalanobis distance is equivalent to 



29 



measurement of the density at the point x. The empirical 
distributions of the clusters will therefore determine the 
cluster to which the observation should be assigned. The 
following is a proof of the invariance of Mahalanobis 
distance under any non-singular linear transformation. 
Consider the transformation 



Y 



BX 



and let D(Y^,Yj) represent Mahalanobis distance between Y^ 
and Yj . 



D (Y. ,Y . ) 

1 ' j 



(Y. - Y.)"^ C"^ (Y. - Y.) 
1 J Y 1 j 



(BX^ - BXj)^ C~^ (BX^ - BXj) 



(Xi - Xj)^ b"^ c“^ B (X^ - Xj) 



(X. - X.)”^ b"^ (BC„b'^)"^ B(X. - X ) 

1 J ^ 1 J 






D(X. ,X.) 
1 3 



Some other common metrics are defined below 



30 



1. norm (city block) 



D(X^,Xj) 



P 

I 

k=l 



X. 



ki 



- X 



kj 



2. Lp norm (liinkowski metrics) 



D(X. ,X. ) 

1 3 



(I 

k=l 



1/P 



3. Uniform norm 



D(X. ,X . ) 

1 3 



= supremiim ( I X, . - X, . I } 



k 1 r2/..»/p 



ki kj 
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III. THE DATA SET 



A. ORIGIN 

The present Student Opinion Form (SOF) system was 
started in the summer quarter of 1975 when it replaced the 
Student Instruction Report (SIR) obtained from the Educa- 
tional Testing Service at Princeton. The SOF form has 16 
questions and space for free-form comments from the students 
The information obtained from the SOF data is used for the 
twofold purpose mentioned in Section I. A of this paper. 

A SOF form (figure 2) should be completed by each stu- 
dent for each course segment he takes for credit. The term 
"course segment" is used because the same course may be 
offered to more than one group of students. To differen- 
tiate between the classes, segment numbers are assigned and 
a separate SOF identification number exists for each segment 
Different segments of the same course may or may not be 
taught by the same professor. About 20 percent of the forms 
are not returned to administration officials due to lack of 
interest on the part of some students and instructors. 
Students have been informed that the results of the SOF 
data are used to assist in identifying faculty members for 
pay raises and tenure considerations. 

Difficulties with legibility of the completed forms and 
with the OpScan machine have persisted for several quarters. 
The data available for this research has been coded with 
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indications where invalid responses occur. Only the valid 
information was considered in this thesis. Mean scores 
were computed for every instructor (every course segment) 
from the valid responses in each of the first 13 SOF items. 
Only the first 13 questions were used because of the high 
percentage of unusable responses in items 14, 15, and 16. 
Each of the responses recorded is an integer from one to 
five, with five being the upper (more desirable) end of the 
scale. These data are therefore considered on an ordinal 
scale. Table one categorizes the blocks of data which were 
available for this study. Note the short 3 -digit notation 
to be used in this paper, indicating calendar year and 
quarter number. 

CALENDAR NUMBER OF 3 -DIGIT 



YEAR 




RESPONDENTS 


CODE 


Summer 


1977 


2440 


773 


Fall 


1977 


2967 


774 


Winter 


1978 


3056 


781 


Spring 


1978 


2964 


782 



Table 1 

The majority of the analysis was performed using only 
quarter 773. Unless otherwise indicated, future references 
to the data set shall imply quarter 773 data. 
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B. TRANSFORMATION 



The need for a common covariance structure when using 
the Mahalanobis metric has been emphasized. The transfor- 
mation of quarter 773 data (which attempted to accomplish 
homogeneity of dispersions) is explained in this section. 

The SOF data are 13-dimensional, and the best trans- 
formations would involve separate examination of each of the 
13 variables. Due to the overwhelming complexity of this 
task, only a single transformation was sought. 

In the SOF data the variance is very much a function 
of the mean. In fact, a course with a 5.0 mean vector has 
no variance whatsoever. Similar effects occur on the lower 
end of the scale. A variance-stabilizing transformation 
was sought which would help to relieve the dependence of the 
variance on the mean. Recall the normal distribution has 
independent mean and variance. Other well known distribu- 
tions- such as the Exponential, Geometric, and Poisson all 
have related mean and variance. The assvimption of multi- 
variate normality underlies much of standard classical multi- 
variate statistical methodology. The effects of departure 
from normality are not clearly understood. Although marginal 
normality does not imply joint mormality, the presence of 
many types of non-normality is often reflected in the marginal 
distributions as well. The marginal distributions of the 
SOF data do not indicate any strong departures from normality. 

Previous research by Professor R.R. Read [7] encountered 
the same need for a transfoinaation of the SOF data. The 
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following transformation is due to Professor Read's 
findings ; 



In where a = .2 (1) 

The transformation was used on SOF item 12, and Bartlett's 
test substantiated the presence of homogeneity of variances. 
The groups involved here were the course segments, and the 
application was univariate. 

Studies by Professor Glen Lindsay [8] and students in 
his course on Scaling Techniques produced results which 
suggested slight modifications to Professor Read's transfor- 
mation. 



In (i^) 
5+b-x 



where 



a = 2.0 
b = 0.3 



( 2 ) 



The same study could be described equally well with a con- 
stant second difference model, or what is the same thing, 
the function 



X 



2 



+ C 



(3) 



The three transformations were considered in the following 
manner. It was felt that the transformation which would 
produce the most nearly homogeneous covariance structure 
would be best. The three functions were applied to quarter 
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773 data, and then statistical tests for common covariance 
were administered. The test statistic comes from reference 5 
and is explained in Appendix A. The results indicated that 
of the three, the first log transformation (1) generated 
the most nearly common covariance structure. The group 
structure whose covariance matrices were compared came from 
clusters formed by the MICKA algorithm (to be discussed in 
the next chapter). On the basis of the test results, the 
data were transformed by function (1) , and all subsequent 
references to the data shall imply the transformed data. 

Functions (1) and (2) are shown together on the graph 
in figure 3. The one chosen for use is the lower curve. 

C. PRINCIPAL COMPONENTS 

Recall the breakdown of the cross products matrix into 
the Siam of the within and between scatter matrices. When 
considering the observations as individual SOFs (and the 
groups as courses) , the cross products matrix will be called 
the Master scatter matrix with decomposition: 

M = S + T 

where S is the within course scatter and T is the between 
course scatter. It is reemphasized that, in this equation, 
the groups are the courses. The breakdown of the master 
scatter matrix may be examined before any clustering of 
course means is performed because the group structure 
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Figure 



(courses are groups) is known. The discussion is enhanced 
by an algebraic description of the matrices involved. Let 



M 



S 



ns . 
N 1 



I I - X..) (X. - X..) 



i=l j=l 



M ns . 
N 1 



y y (X. . - X. ) (X. . - X. ) 

^ 1- n 1* 



i=l j=l 



11 



N 

T = y ns . (X. - X ) (X. - X ) ' 

^ 11* •*X* 

i=l 



where 




is the j-th SOF response form from the i-th 
course . 



X. is the mean vector of the i-th course. 

1 • 

X, ^ is the grand mean. 

nSj^ is the number of students in the i-th course. 
N is the total number of courses . 



Since T represents the dispersion of the course means, it 
is the main object of the clustering efforts. It is natural 
to ask also, how much information is in S. To this end a 
principal components analysis was performed on the covariance 
matrices : 
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0iP44 



N-1 = 189 



C 



T 



1 

N-1 



T 



C 



s 



I (ns^-1) 
i=l 



I (ns^-1) = 1993 
for quarter 773 data 



Anderson [6] describes principal components as the axes of 
a coordinate system with special statistical properties. 

The principal components form a new coordinate system 
resulting from linear transformations of the variables 
which produce the special properties in terms of variances. 
The idea is to describe the data swarm by a new set of 
orthogonal coordinates so that the sample variances with 
respect to the new coordinates are in decreasing order. 

If the eigenvalues of the covariance matrix are ordered, 
i.e., > ^2 > • • • > Ap, then the variance in the new 

coordinate system is greatest in the dimension associated 
with Aj^, next greatest in the dimension associated with A 2 , 
etc. The Siam of the eigenvalues is the total variance in 
the original coordinate system. 

The results of the principal components analysis are 
shown in Table 2. First, it is of interest to compute how 
much of the total energy in M is accounted for by T. 

TOTAL = 18.8(1993) + 156(189) = 66952 



40 



I 










PRINCIPAL COMPONENTS ANALYSIS 





s 

EIGENVALUES 


C EIGENVECTOR 
^ FOR 


^S 

EIGENVALUES 


C EIGENVECTOR 
^ For 


1 


0.44 


0.28 


0.35 


-0.27 


2 


136.97 


0.30 


0.47 


-0.30 


3 


4.64 


0.28 


0.49 


-0.28 


4 


0.46 


0.29 


0.51 


-0.29 


5 


3. 66 


0.23 


0.52 


-0.19 


6 


2.57 


0.19 


0.63 


-0.19 


7 


1.14 


0.25 


0.63 


-0.24 


8 


1. 63 


0.27 


0.68 


-0.29 


9 


1.47 


0.32 


0.77 


-0.33 


10 


0.66 


0.30 


0.89 


-0.33 


11 


1.23 


0.26 


1.00 


-0.28 


12 


0.96 


0.35 


1.10 


-0.30 


13 


0. 84 


0.26 


11.00 


-0.27 


TOTAL 


156 




18.8 





TABLE 2 

T accounts for 29434 -f- 66952 = 44 percent of the total. 

This indicates that a great deal of variability must there- 
fore be accounted for within the courses (i.e., with the 
students) . 

The principal components analysis of Cg shows the first 
principal component accounts for 55 percent of its total 
variance, but all other coordinate directions each account 
for 6 percent or less. Moreover, the direction of the first 
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component is essentially the main diagonal of 13 space, 
i.e., the signs are all the same and so are the magnitudes 
(approximately) . Thus the data swarm may be thought of 
as an elongated ellipsoid directed along the main diagonal 
and having spheroidal (more or less) cross section. In 
particular, this suggests that the students within a course 
tend to score all 13 components more or less the same (all 
high, all moderate, or all low) , but perceptions from student 
to student differ. 

Turning to the principal components analysis of C^, it 
is seen that 85 percent of the total variability is accounted 
for by the first principal component, and the second accounts 
for only three percent. Thus the data swarm of course means 
may be viewed as essentially one dimensional. Reference to 
its eigenvector reveals no single SOF item or group of SOF 
items is heavily weighted relative to the others and that the 
signs are again all the same. Thus, this component is 
similarly shaped along the main diagonal of 13 space, but 
more extremely elongated. 

Some exploratory work was done on the within class 
variability (S) to see if the "number of quarters completed" 
by students has any effect on the variability represented 
by S. Figure four presents the results with a graph plotting 
within course variance versus time on board. Note the 
tendency for the variability to drop off in later quarters, 
possibly indicating more perfunctory completion of the forms. 
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NUMBER OF QUARTERS COMPLETED 



IV. THE MIKCA I4ETH0D 



A. THE ALGORITHM 

The specific algorithm chosen for the cluster analysis 
is the MIKCA (I'lultivariate Iterative K-MEANS Clustering 
Algorithm) program written by Douglas J. McRae as a part 
of his doctoral dissertation at the University of North 
Carolina, Chapel Hill. 

Reference to the flow chart in figure 5 will aid the 
reader in the following discussion of the algorithm. Inputs 
to the program are the data matrix, an estimate for g (the 
number of clusters) , and choice of criterion and distance 
functions . 

In the first step, preliminary calculations are made, 
such as the variable means and standard deviations, as well 
as the cross products matrix T. The next step forms the 
initial cluster solution. A random choice of s observations 
serves as the initial cluster centers. Then each of the 
other observations is assigned to the nearest cluster. 
Euclidean distance is used for this initial phase, and the 
cluster centroids are recomputed after each observation is 
assigned to a group. The observations are considered in the 
same order as they were input. After all of them have been 
assigned to clusters, the criterion value is computed. This 
initial cluster-finding technique is referred to as a one- 
pass K-MEANS procedure. It is performed three times, and 
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the solution which yields the best criterion value is chosen 
as the initial cluster solution. 

After the initial solution has been found, the program 
advances to the iterative K-MEANS phase where the observa- 
tions are again considered in the order in which they were 
input to the program. It is this phase where the user's 
choice of distance function is used. The distance from each 
observation to each cluster centroid is again computed, this 
time with the user's distance function, the assignment to the 
closest centroid being made and the centroid updated to 
reflect its new membership. After considering all n obser- 
vations in this manner, the new criterion value is checked 
for possible improvement during the K-MEANS iteration. As 
long as the criterion value improves, the K-MEANS procedure 
is repeated; if the criterion fails to improve then the MIKCA 
algorithm goes to the next step, the individual switches 
section. 

Note the importance of the order of consideration of the 
observations. The order is important because the cluster 
means are recomputed after each observation is reassigned. 

In the individual switches phase, consideration is given 
to moving each observation to every other cluster, the move 
being made if and only if an improvement in the value of the 
criterion results. An elaborate labelling procedure pro- 
vides a unique order in which to consider each observation. 
This procedure continues until a complete pass through the 
data is made with no changes in cluster membership. 
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The MIKCA algorithm provides the following options for 
distance and criterion functions. 

CRITERION 

1. Minimum trace W 

2. Minimum det W 

3. Maximura largest order of |b-Aw 1 = 0 
4 - Maximum sum of roots of [ B-AW | = 0 

DISTANCE 

1. Euclidean 

2. Weighted Euclidean 

3 . Mahalanobis 

Using R.A. Fisher's iris data, McRae tested his algorithm 
and produced extremely good results . Using the det W 
criterion and Mahalanobis distance, MIKCA produced a solu- 
tion identical to the classification given by multiple 
discriminant analysis. This is a notable achievement since 
the cluster procedure, which does not know the true composi- 
tion before the analysis, makes the same final classification 
of observations as does the discriminant procedure, which 
bases its analysis on the group composition information. 

The MIKCA provides as output the value of the criterion 
function, the cluster membership, and the cluster mean vectors. 
Also provided are two matrices, T and W. The program was 
written in FORTRAN IV for the IBM 360 series of computers. 
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B. MODIFIED MIKCA 



Initially, the MIKCA program was used with the p-component 
mean responses for each course as the input data matrix. 

Since the nvimber of students utilized in producing these 
means is quite variable, these input vectors are not equally 
well determined and, as has been mentioned earlier, may 
effect the covariance structure between the objects. It 
is desirable to have the option of weighting these course 
means in order to effect a better balance in terms of their 
accuracy and to reduce any consequential distortion in the 
covariances. It is convenient to refer to this modification 
as the "1 man 1 vote" option, and to the original technique 
as the "1 course 1 vote" option. The following algebraic 
definitions will aid in illustrating the weighting effect. 

Recall the breakdown of the master scatter matrix into 
the sixm of within and between matrices . 

M = S + T 



V/hen the mean scores are computed for each course and used 
as inputs to MIKCA, then a different dispersion structure 
takes form. The groups are no longer the known courses, 
but are now the object of the problem. The groups are unknown 
clusters of courses (or professors). Let T* denote the 
total scatter contained in the data when each observation 
represents a course mean vector. T* may also be expressed 
as the sum of within and between scatter matrices. 
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T* = W* + B* 



These matrices are defined as follows: 



T* = 



nc 

g s 

I I (X 

s=l k=l 



sk 



X ) (x^, - X )' 

• • sk 



W* = 



g s 



y y (x , - x^ ) (x , - X ) ' 

^ sk s . ' sk s • 



s=l k=l 



g 

B* = y nc (x^ - X ) (x . - X )' 

L ss* •• s •• 

S=1 



where 



nc is the number of observations (courses) in 
s 

the s-th cluster. 



g is the number of clusters. 

Xgj^ is the k-th observation (course mean vector) 
in the s-th cluster 



X 



is the mean vector of the s-th cluster 



I X 



sk 



nc^ 



x_ is the grand mean 

Z Z 7 



s k 



sk 



T 



nc 
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Note that the grand mean mentioned here is not the same as 
the grand mean used in the decomposition of the master 
matrix M. The difference between T and T* is that T is 
weighted by the number of students in each course, ns^^. This 
weighting factor was lost when the individual observations 
were viewed as the class mean vectors. A close algebraic 
examination of T will illustrate its weighted property. 
Originally, we had M = S + T where: 

N 

T = I ns^(Xj__ - X. .) (Xj_, - X. .) ' 

i=l 

' It is now helpful to show the decomposition of T. 



T = W + B 

1 



Let X.. become x , (k-th course mean in s-th cluster) and 
ns^ become ns^j^ (number of students in k-th course of s-th 
cluster) . Therefore the same T can be reexpressed as 



g s 






sk 



s=l k=l 



Letting 



nc 



W = y ns , (x , - x„) (x , - x^) 

s ^ sk sk s ' sk s 

k=l 
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where 



i "^sk ^sk 
— k=l 

X = — (weighted mean vector of s-th 

s cluster) 

i " = sk 

k=l 

then 

g 

w = y W 
^ s 

s=l 

and 

T = W + B (B is obtained by subtraction) 



The understanding of this distinction is important because 
it describes the abbreviated (unweighted) dispersion upon 
which MIKCA bases its cluster solution. 

A number of changes were made to the MIKCA computer code 
to allow for a system of weights, ns^^, for the course means. 
The modified code extends the capability of MIKCA by making 
this option available. It amounts to using T rather than 
T* as the basic dispersion structure. This seems more natural 
because the matrix T appears in the earlier decomposition. 



M = S + T = S + W + B 
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Some of the changes are summarized here : 



1. Allow for class size as input. 

2. Alter the computation of T to allow for the weighting 
factor . 

3 . Alter the computation of cluster centroids to allow 
for weighting. 

4 . Alter calculations of the B matrix for the same 
reason. (W is found by subtracting B from T.) 

The computer code for the modified MIKCA is included in 
Appendix F. 

Cluster solutions using both weighted (T) and unweighted 
(T*) dispersion structures were found and compared (see 
table 3 in next section) . The comparison indicates some 
differences in cluster solutions, however the importance 
of these differences is left to the reader. 

C . RESULTS 

Several cluster solutions were formed using the MIKCA 
algorithm. It seemed wise to include the number of students 
in a course as the 14-th variable. The natural logarithm 
of the class size was the transformation applied to this 
variable. Since class sizes ranged from two to 40, this 
transformation brought the values into a similar range as the 
other 13 variables and also reduced skewness. For quarter 
773 the mean class size was 12.7 students with a standard 
deviation of 7.9. For the transformed variable these 
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statistics are 2.3 and 0.7. Cluster solutions were found 
with and without inclusion of this 14-th variable. The 
results are shown in table three. 

Another option available to the MIKCA user is the stan- 
dardization of variables prior to entering the clustering 
process. McRae points out how this option becomes very 
useful when the variables are on vastly different scales of 
measurement. Except for the 14-th variable the present 
scales are psychological in nature and seem to be much the 
same. Some exploratory work was performed with the standar- 
dization option (see table three) but it was not considered 
significant because of the similarity in the scales of 
measurement. 

Table three shows the comparisons of cluster results 
obtained under various conditions. The comparison coeffi- 
cient provides a measure of agreement between solutions and 
is computed by a method introduced in Chapter VII. Table 
three shows generally higher values for g = 3 , indicating 
that there exists robustness of solutions for the smaller 
values of g. 

The results of these cluster solutions may also be seen 
in graphical form by referring to Appendix B. These graphs, 
called profile charts, depict the mean vectors for each of 
the clusters formed by the MIKCA algorithm. The mean vectors 
have been standardized so that one can see the number of 
standard deviations from the grand mean. These profiles are 
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COMPARISON COEFFICIENTS FOR CLUSTER SOLUTIONS 
OBTAINED UNDER VARIOUS CONDITIONS 
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also helpful in identifying the variables which are signi- 
ficant in the cluster membership. For example, an important 
variable would be one that produces a break in the pattern. 

In the 13 variable case, the profiles produced results 
which indicated the lack of clearly dominant variables in 
cluster identification. With introduction of the 14-th 
variable, some very revealing results become immediately 
apparent. While the cluster membership changed little in 
going from 13 to 14 variables, the cluster with the highest 
mean vector became clearly associated with the smallest 
class sizes. Similarly the cluster with the lowest mean 
vector is characterized by a very large class size. This 
finding is one of the most significant results. 

One of the most critical decisions facing the analyst 
is the number of clusters to form. Some algorithms based 
on the K-MEANS idea allow g to change during the clustering 
process, however the MIKCA method requires g to be input by 
the user and it does not change in the course of the pro- 
gram execution. Typically the investigator does not know 
the nuiiiber of clusters in the data, and he must make some 
educated guess. As pointed out earlier, it is possible for 
several different, but meaningful, cluster solutions to 
exist in one body of data. 

The method used to determine g was to obtain solutions 
based upon several values for g and then plot the criterion 
values for each of these solutions. An appropriate choice 
for g would be a number beyond which the marginal improvement 
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of the criterion becomes insignificant. Figures 6 and 7 
are the results of such tests suggesting that six clusters 
represent the major portion of the separating power of the 
algorithm. 

Profile charts of the cluster solutions with g = 6 were 
uninteresting. The middle clusters were all bunched 
together suggesting that clusters were forced on that part 
of that data where perhaps they did not actually exist (i.e., 
sparse data near the boundaries) . Comparison results (table 
3) indicate a much more stable solution when g is reduced 
below six. 
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NUMBER OF CLUSTERS 



V. DISCRIMINANT ANALYSIS 



A. THEORY 

As mentioned earlier, discriminant analysis allows an 
analyst to classify new observations based on observations 
which are samples from known groups. Only Fisher's linear 
discriminant function will be used in this study. It also 
provides information about the relative importance of the 
various variables in assigning an observation to a group. 

The linear discriminant function is based on the assumptions 
of multivariate normality and homogeneity of dispersions . 

The ability to identify the dominant variables and the 
dimension reduction offered by the discriminant space were 
both extremely useful aids for analyzing the SOF data. 

These "more important" variables will be earmarked for 
later use in the construction of Chernoff's FACES. Also 
of interest is the plot of data points in discriminant 
space. The interaction of the coefficients in the dis- 
criminant functions will be seen as well as the character- 
ization of the dimensions. 

In order to describe our usage of discriminant analysis, 
let us first suppose there are only two clusters in 13- 
dimensional space. It is deisred to project these two 
clusters orthogonally onto a line so that the variation 
between the two groups is as large as possible relative to 
the variation within the two groups. Finding the direction 
of projection to accomplish this is part of the purpose of 
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discriminant analysis. The solution provides a way of 
discriminating between the two clusters by a suitable linear 
combination of the 13 variables. The same theory is gen- 
eralized to g groups, where Wilks [9] has shown that a 
projection to the smaller of g-1 or p dimensions is possible 
without loss of information. Recall the earlier discussion 
that indicated this smaller number as t, the number of non- 
zero eigenvalues of W ^B. The eigenvalues are the variances 
in the direction of their associated eigenvectors. One can 
easily determine the proportion of variance attributable 
to each of the dimensions of discriminant space and also 
the SOF items which load most heavily in each dimension. 

One gains insight into the variables from examination of 
the coefficients in the discriminant functions. There is 
one function for each dimension, the standardized coeffi- 
cients of which are used in this analysis. 

B . RESULTS 

Up to this point, most of the analysis has been performed 
on the 190 courses in quarter 773. A smaller, more manage- 
able data base was needed to continue. Also, it seemed 
wise to prepare to study individual departments . The 
Electrical Engineering Department was chosen for further 
analysis since it is a large department and hence not too 
small for this purpose. Over the four quarter period, there 
were 116 course segments with valid SOF responses. These 116 
courses from the EE department were the data used in the 
discriminant analysis. 
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When dealing with four clusters, the dimensionality of 
the discriminant space is three, and depending on the size 
of the eigenvalues , perhaps fewer dimensions will provide 
sufficient discrimination. Table four gives the results of 
performing a discriminant analysis. Figure eight is a graph 
of the two-dimensional discriminant space (the third 
dimension is neglected) . 

The eigenvalues indicate 94 percent of the total variance 
is represented by the first two discriminant functions. 

Figure eight corroborates this fact by depicting easily 
seen separation in two dimensions. Imagine projecting the 
points onto the horizontal axis. Discrimination in the 
first dimension would account for 73.6 percent of the 
variation. Groups one and four would easily be separated, 
but two and three would overlap. 

Examination of the coefficients will enable one to label 
the dimensions by identifying the dominant characteristics 
which they measure. The first dimension is along the hori- 
zontal axis and is associated with the first discriminant 
function. The magnitude of the coefficients indicates their 
relative impact on the dimension. The signs aid in under- 
standing which variables reinforce one another (matching 
signs) and which tend to cancel (opposite signs) . In the 
first function of table four, SOF item 12 is the most promin- 
ent. This question (see figure 2) asks the student to score 
the overall rating of the instructor. It is not surprising 
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RESULTS OF DISCRIMINANT ANALYSIS ON THE 116 COURSES 
IN EE DEPARTMENT OVER A FOUR QUARTER PERIOD 



DISCRIMINANT 


EIGENVALUE 


RELATIVE 


FUNCTION 




PERCENTAGE 


1 


5.79 


73.6 


2 


1.64 


20.8 


3 


0.44 


5.6 





Function 


STANDARDIZED DISCRIMINANT 
FUNCTION COEFFICIENTS 

1 Function 2 


Function 3 


1 


1 — 1 
1 — 1 

• 

0 

1 


0.23 


-0.22 


2 


0.13 


0.14 


-0.09 


3 


-0.05 


1.46 


0.01 


4 


-0.47 


-0.36 


-0.80 


5 


-0.31 


0.01 


-0.82 


6 


0.08 


0.36 


-0.74 


7 


0.15 


-0.36 


0.12 


8 


0.23 


1.15 


-0.36 


9 


0.36 


-0.82 


-0.47 


10 


0.05 


0.08 


-0.77 


11 


-0.20 


-1.16 


1.18 


12 


-0.72 


-1.08 


1.80 


13 


-0.18 


0.91 


0.92 






TABLE 4 
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that this question is very important in the discrimination 
process. High marks on items four and five tend to rein- 
force a high mark on question 12. Those questions are; 

(4) Difficult concepts were made understandable. 

(5) I had confidence in the instructor's knowledge 
of the subject. 

Interestingly, a high mark on item nine (instructor made 
the course a worthwhile learning experience) tends to 
diminish the effect of a high score on item 12. The first 
dimension is dominated by question 12 and was labeled the 
"popularity" dimension. 

The second dimension is depicted by the vertical axis 
on the graph in figure eight, and measurements along this 
dimension are controlled by the second discriminant function 
The separating power in this direction is less than one 
third that of the first. Note however that the vertical 
scale is compressed 25 percent more than the horizontal 
scale (1.5 inches vertical = 2.0 inches horizontal). Items 
three and eight has strong positive coefficients whereas 
questions 11 and 12 are pulling heavily in the negative 
direction. However the strength of the information is not 
great, and deeper interpretation hardly seems worth the 
effort. 

Only 5.6 percent of the total variance appears in the 
third function, and it is therefore considered insignificant 
One might note that item 12 also dominates the third 
dimension. 



64 



The main purpose here has been to identify variables 
for use in constructing Chernoff FACES. The discriminant 
analysis has served that purpose well, and it has also 
described the character of the dimensions. 



65 



VI . CHERNOFF FACES 



A . BACKGROUND 

Chernoff's FACES was the second cluster method to be 
applied to the SOF data. The method was used with the 
same purpose in mind, and it was hoped that earlier cluster 
solutions could be reproduced by this method. Additionally, 
there was the possibility of gaining new information about 
the structure within the data. Professor Herman Chernoff 
developed this graphical method for representing multivariate 
data. The now familiar data point in p-space is represented 
by a computer-drawn cartoon of a face whose characteristics 
(features) are determined by the position of the point. 
Features such as nose length and mouth curvature correspond 
to components of the data point. In the case of the SOF 
data, each component of the 13-dimensional vector can be 
made to control one of 20 features, and seven constants can 
be selected for the remaining features. The technique lends 
itself to clustering since the investigator can group 
together those faces which resemble each other. 

Chernoff [10] points out that people spend a great deal 
of their life studying and reacting to faces. The human 
mind subconsciously acts as a high speed computer sometimes 
detecting barely measurable differences and ignoring unim- 
portant differences, even if they are large. Chernoff 
claims that unlike a machine, the mind has the capability 
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to disregard non-informative data and search for useful 
information. He states that certain major characteristics 
of the faces are instantly observed and easily remembered; 
finer details and correlations become apparent after studying 
the faces. Clustering by sorting the faces is certainly 
easier than staring at a large matrix of data. The method 
has pitfalls and limitations and some of them will be dealt 
with in this thesis. 

After the publication of Chernoff's method [11], quite 
a number of people began experimenting with the technique. 
Lake [12] mentions a few more successful applications of 
Chernoff's method, including; 

1. L.A. Bruckner of Los Alamos Scientific Lab of the 
University of California while studying the performance 
of offshore oil companies. 

2. Johns Hopkins University 

a. Developing methods of psychiatric screening. 

b. Monitoring patients in intensive care units. 

c. Monitoring the stock market. 

3. Dr. David L. Huff of the University of Texas in 
developing urban regional indicators that measure the 
quality of life. 

4. Professor P.C.C. Wang and Gerald Lake at the Naval 
Postgraduate School in analyzing Soviet naval penetra- 
tions into the Indian Ocean and the African littoral; 
and Soviet foreign policy in sub-Saharan African 
states . 
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5. Professor Chernoff in geological and fossile- 

related experiments. 

The field of computer graphics has experienced tremendous 
growth in recent years due mainly to the state of the art 
in computers and computer display equipment (including both 
video and plotting types) . The adage that "a picture is 
worth a thousand words" has proven to be quite true. Recent 
developments include on-line programs that perfoimi statis- 
tical analysis with polygon, bar graphs, arrows, and scatter 
diagrams. Three-dimensional data displays have facilitated 
the work of engineers and statisticians alike. 

An interesting application of the FACES program is 
Bruckner's study of offshore drilling by oil companies. 

Figure 9 displays some of his results . Two of the features , 
nose width and eye separation, are controlled by the varia- 
bles "expected years to production" and "number of leases 
won", respectively. Other features are controlled by a 
variety of variables representing the company's financial 
health and growth potential. 

Reference to figure 10 will help describe how the faces 
are constructed. Table 5 gives the range of the variables 
which control the features and distance parameters of the 
face. 

The data are first converted to the X parameters as 
follows. The variable Z is used to control the parameter 
which is allowed to range from a^ to b^^. 
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Figure 9 

Bruckner's Offshore Hydrocarbon Producers 
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Figure 10 



Chernoff Face with Ears 
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FEATURE RANGES AND DESCRIPTION 



This table is taken from reference 10, and the descriptions 
are not complete. For a more detailed, mathematical 

explanation, see Appendix C. 



Range 










(0,1) 


^i 


controls 


h* 


distance from 0 to P 


(0,1) 


^2 


controls 


6* 


angle between OP and horizontal 


(0,1) 


X 3 


controls 


h 


half-height of face 


(0.5,2) 


^4 


is 




eccentricity of upper ellipse 
of face (width/height) 


(0.5,2) 


^5 


is 




eccentricity of lower ellipse 
of face (width/height) 


(0,1) 


^6 


controls 




length of nose 


(0,1) 


X? 


controls 


^m 


position of center of mouth 


(-5,5) 


^8 


controls 




curvature of mouth (radius = h/Xg) 


(0,1) 


X 9 


controls 


^m 


length of mouth 


(0,1) 


^10 


controls 


^e 


height of centers of eyes 


(0,1) 


^11 


controls 


^e 


separation of centers of eyes 


(0,1) 


X 12 


controls 


e 


slant of eyes 


O 

• 

O 

• 

00 


Xi3 


is 




eccentricity of eyes (height/width) 


o 


^14 


controls 


^e 


half-length of eye (L^ also 
depends in part on x^^q and 


(0,1) 


^15 


controls 




position of pupils 


(0,1) 


^16 


controls 




height of eyebrow center relative 
to eye 


(0,1) 


Xi7 


controls 


e**-e 


angle of brow relative to eye 


(0,1) 


X 

!-• 

CO 


controls 




length of brow 


(0,1) 


Xi9 


is 




ear diameter 


(0,1) 


^20 


is 




nose width 



TABLE 5 
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X. 

1 



ai + 



Z - m. 



(b. - a. ) („ 

1 1 M - 



m 



where m and M are the observed minimum and maximum of Z. 

Chernoff's technical report [10] presents a very detailed 
description of the geometric relationship of the features 
in the face construction. A few general remarks concerning 
the geometric attributes are included here. The boundary 'of 
the face is formed by joining portions of two ellipses, an 
upper and a lower. The angle theta (9) determines where 
the ellipses meet and consequently, the height of the ears. 

The nose is a triangle centered at the origin. Both its 
height and width are variable. The curvature of the mouth 
is a portion of a circle, the radius and center of which are 
also variable. The eyes are formed by ellipses whose angle, 
half-length, and eccentricity are all controlled by variables. 

B. FEATURE-VARIABLE RELATIONSHIP 

A frequent question is whether some features are more 
informative than others. Some observers feel that the eyes 
convey the most information; others regard the mouth or the 
shape of the face as the most relevant feature. The results 
of the discriminant analysis identified the most dominant 
variables in the discriminant space. Now these variables 
must be assigned to facial features. 

Chernoff [13] himself conducted an experiment to evaluate 
the effect on classification error of random permutations 
in the assignment of variables to features. He found that 
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random permutations would change the faces so that a 
classifier might increase or decrease his number of errors 
by a factor of about 25 percent. Unfortunately, his experi- 
ment did not evaluate the efficiency of specific features. 
His studies also make no effort to determine whether ability 
to discriminate depends on the dimensionality of the data. 

Considering Chernoff's findings, it would seem that 
the assignemnt of variables to features is of minor 
importance. The use of discriminant analysis provides a 
way of detecting which variables are important, and it 
seems appropriate to take advantage of this valuable infor- 
mation when constructing the faces. Moreover, there is 
choice in the features that are selected for use. The 
author's choice of the six best features are starred in 
table 6. The table gives the complete list of feature- 
variable combinations. The results of the discriminant 
analysis were relied upon heavily in forming the variable 
assignments . 

Reference to figure eight (discriminant space) and 
table 6 will aid in the following discussion. In the first 
dimension the important SOF items are 12 and 4 which con- 
trol the mouth curvature and ear height, respectively. 

High scores on these two items separate the observation 
well to the negative end of the scale and cause the face 
to have a big smile and high ears. Items 12 and 4 have the 
same sign (negative) but item 9 is associated with a large 
positive coefficient and controls the lower eccentricity of 
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FEATURE-VARIABLE COMBINATIONS 



THREE DIFFERENT TRIALS 





FEATURE 


13 VAR 


CONTROLLED BY 
6 VAR 


8 VAR 


1 


FACE WIDTH 


0.5 


0.5 


0.5 


2 


ANGLE 0 


4 


0.65 


4 


3 


FACE HEIGHT 


0.7 


0.7 


0.7 


4 


UPPER ECCENTRICITY 


8 


0.95 


0.95 


*5 


LOWER ECCENTRICITY 


9 


4 


0.6 


6 


NOSE LENGTH 


10 


0.45 


9 


7 


MOUTH CENTER 


0.5 


0.3 


0.5 


*8 


MOUTH CURVATURE 


12 


12 


12 


9 


MOUTH LENGTH 


13 


0.7 


0.8 


10 


EYE HEIGHT 


0.23 


0.23 


0.23 


11 


EYE SEPARATION 


1 


0.5 


0.5 


12 


EYE SLANT 


11 


3 


3 


13 


EYE ECCENTRICITY 


3 


0.6 


0.6 


14 


EYE HALF LENGTH 


6 


0.5 


5 


15 


PUPIL POSITION 


2 


9 


13 


16 


EYEBROW HEIGHT 


0.3 


0.3 


0.3 


17 


EYEBROW ANGLE 


5 


8 


8 


18 


EYEBROW LENGTH 


0.4 


0.4 


0.4 


19 


EAR DIAMETER 


0.3 


0.3 


0.3 


20 


NOSE WIDTH 


7 


11 


11 



Integer numbers are the SOF item # . 
Decimal values are the fixed features 



TABLE 6 
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the face. A low mark on this item would complement high 
marks on items 12 and 4 and would be reflected in the lower 
face having small eccentricity (more narrow) . 

Turning to the vertical axis (second dimension) of 
figure eight which has 21 percent of the total variance, 
the dominant variables are 3, 8, and 11, where 11 is nega- 
tive; 3 and 8 are positive. High scores on items 3 and 8 
separate the observation upward on the vertical axis and 
are reflected as highly eccentric eyes and upper ellipse. 

Droopy eyes, reflecting a small value for SOF item 11, 
tend to complement and reinforce the higher values for 
iteids three and eight. It seems like a good idea to use 
the results of the discriminant analysis in this way, but 
it is impossible for the viewer to know which variables act 
together and which interfere unless he is told beforehand. 

A good deal of exploratory work was carried out to 
determine useful ranges for the features. The more the 
features are allowed to vary, the wider the variety of 
faces produced. With large ranges, however, faces formed 
from extreme data can become very distorted . On the other 
extreme, too little variability in the ranges suppresses 
valuable information and hinders the clustering process. 

It appears that the best ranges depend on the structure 
and amount of variability in the data. Every data set has 
its own characteristics, and it is best to tailor each to 
its own best set of ranges. A great portion of the SOF 
data is found "close” to the grand mean, but with a few 
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significant outliners . In order to provide discriminating 
ability among the largest mass of the data, the appearance 
of the outliers was further accentuated. The ranges were 
set at values which would allow close-in discrimination, 
but simultaneously attempted to minimize the departure of 
the outliers. 

C. CLUSTERING THE FACES 

The next step in this research is to cluster the faces. 
This task was performed by six students in the Operations 
Research curriculum. The faces are shown in figure 11; the 
33 course segments from the Electrical Engineering depart- 
ment in quarter 781. The judges (students performing the 
clustering) were given no information concerning the feature- 
variable combination. They were simply instructed to group 
the faces in the manner which best suited them. Fifteen 
minutes were allowed for the task. The purpose was to 
quickly, but carefully, cluster the faces. The judges 
were reminded that each face is different and to search for 
the most natural looking clusters. It was felt that too 
much time spent on this task could defeat the purpose of the 
faces as a first pass look at the data. In every case, the 
judges acted independently of one another. No clues were 
provided which might have indicated which features were 
more important. 

Figure 11 shows the faces in the clusters which were 
formed by the MIKCA algorithm. This cluster structure was 
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Figure 11 

CLUSTERS DETERMINED BY MIKCA 






I 
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used as a standard against which the judges' results were 
compared. Table 7 shows the results of this experiment. 
There was considerable agreement among the judges, as 
indicated by the comparison coefficients. There was also 
a good deal of similarity between the clusters formed by 
the judges and those fomed by the MIKCA algorithm. 

Several comments by the judges indicated the difficul- 
ties they encountered. The most prevalent comment was the 
difficulty in deciding which feature to consider the most 
important. One judge considered the mouth first in every 
case while another judge used the slant of the eyes as 
a more important variable. The judges also indicated that 
trying to evaluate simultaneously differences in many 
features was quite difficult. It is interesting to note 
that the judges' results were quite similar despite the 
fact that different criteria were employed as they formed 
the clusters. 

The SOF identification numbers have been altered for 
this report. There were two course segments which erron- 
eously reported the same SOF number (see face 150) . As one 
looks at the faces with the discriminant space in mind, it 
is much easier to foinn a clustering which is similar to the 
MIKCA solution. One would be aware, for example, that the 
position of the pupils is critical in that it can diminish 
the effect of the smile and impact heavily on the horizontal 
dimension. This effect can be seen by referring to faces 
139 and 140; they are included in a group whose smiles are 
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TABLE 7 



COMPARISON COEFFICIENTS FOR ALL PAIRS OF JUDGES 



JUDGES 





1 


2 


3 


4 


5 


6 


1 


1.0 


.69 


. 73 


. 82 


. 73 


. 78 


2 




1.0 


.90 


. 80 


.77 


. 68 


3 






O 
1 — 1 


. 65 


. 81 


.67 


4 








1.0 


.68 


. 73 


5 










1.0 


.69 


6 












1.0 



COMPARISON COEFFICIENTS 
BETWEEN EACH JUDGE AND MIKCA 



JUDGE 

1 

2 

3 

4 

5 

6 



COMPARISON 
COEFFICIENT 
T?9 
.58 
.68 
.81 
. 76 
.73 



SIMULTANEOUS COMPARISONS OF 
MULTIPLE JUDGES 



NUMBER OF 
JUDGES 

3 

4 

5 



COMPARISON 
COEFFICIENT 
.52 
. 44 
.38 
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not as large as their own because the position of their 
pupils (positive to the right) has diminished the impact 
of the curvature of the mouth. 

Another example of the interaction of variables is 
seen by referring to face 152. Most judges would quickly 
include this face with the group of four at the top of 
figure 11. A subtle difference, however, is the ear 
position. Reference to the discriminant function coeffi- 
cients will indicate a negative which has offset the slant 
of the eyes in the second dimension. 

Knowledge of the discriminant functions helps alleviate 
the confusion which sets in when attempting to cluster. It 
is especially true in this case where so little difference 
exists between the majority of the faces in the middle groups. 

Difficulty in evaluating all 13 features simultaneously 
was a problem. As an alternative to this set of faces, two 
other sets were produced, one with only six variables fea- 
tures and the other with eight. Figure 12 contains samples 
from these sets, 12a the six variable set and 12b the eight 
variable. Of course, not all of the data is represented in 
this manner. Only those variables which loaded heavily in 
the discriminant analysis were used, and the features con- 
trolled by those variables are the ones considered to convey 
the most information. Table 6 gives the complete feature- 
variable combinations used in the construction of all sets 
of faces. The data used in constructing the set of 33 faces 
is found in Appendix B. 
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D . PROBLEMS ENCOUNTERED 



The last section addressed difficulties faced by the 
judges because it was impossible for them to be aware of 
the information contained in the discriminant analysis. 

Of course, it would be pointed out that there is little 
reason to use this particular MIKCA solution as the stan- 
dard, but it does serve as an objective standard, as it 
was desired to compare the machine results with the human 
results. This section addresses problems of a more 
mechanical nature. 

Exploratory work with the faces uncovered quite a 
number of relationships between the features. The exis- 
tence of geometric dependencies (not discriminant-type 
effects) between features caused difficulties in clearly 
displaying the variables. Two notable examples are 
mentioned here. 

The length of the mouth is quite dependent upon its 
curvature. The projection on the horizontal axis (no 
relation to discriminant axis) has half-length 

a^ = X9(h/|X8|) 

where X8 is the mouth curvature. The variables which con- 
trol these features are thus automatically forced into this 
dependent status regardless of their true relationship. 

The other example concerns the ellipses forming the 
facial boundary and the angle theta. The upper ellipse is 
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drawn through the points P', U, and P; the lower through 
P' , L/ and P (see figure 10). Two faces with identical 
values for the ellipses might have quite different appearing 
facial boundaries due to the dependence on theta for the 
points P' and P. This is another example of forced dependence. 

In order for the width and height of the face to meet a 
specified constant, the program "normalizes" both horizontal 
and vertical axes. This normalization eliminates the 
effects of XI and X3 , and it adjusts all of the features 
during the process. It is believed to be this normalization 
process which causes faces which are growing wider and wider 
to suddenly revert to one-half the widest width when the 
width exceeds a threshold value. A similar phenomenon is 
experienced in the height variable. This half-size adjust- 
ment may be seen in figure 12b. Face 132 has been changed 
by a disproportionate amount due to the normalization pro- 
cess. It is of interest to point out that the face-width 
feature was being held constant during the construction of 
that set of faces. 

Yet another hidden dependency is that of the nose length 
on the eye height. The eyes are located at height 

y^ = h[X10 + Cl - X10)X6] 

where X6 controls the length of the nose. 

These and other subtle dependencies mislead the inves- 
tigator if he is not aware of their existence. These problems 
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reduce the ability of the faces to effectively display the 
full 20 dimensions. Unfortunately, these points are not 
explicit in the original document [11] and their discovery 
was an 11-th hour surprise. It was not possible to adjust 
for them or to uncover all such relationships at this 
writing. Appendix C gives a complete listing of the formulas 
used in the construction of the faces. 
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VII. COMPARISON COEFFICIENT 



A. BACKGROUND AND ALTERNATIVES 

Repeated use of the comparison coefficient has been 
made in this study. The present chapter is devoted to an 
explanation of this measure of association. The method 
should be flexible enough to handle multiple comparisons 
simultaneously, thus enabling one to measure the overall 
agreement of several judges. 

It was decided the best way to display the agreement 
of two judges was through the use of a contingency table. 
Table 8 is an example of one to be used in the discussion. 



Judge X 



o A 

'B B 

1-3 



ABC 



5 


0 


1 


1 


3 


3 



Table 8 

The contingency table indicates the agreement of the two 
judges. The purpose of this chapter is to find a measure 
which evaluates how close this agreement is. Note that 
judge X categorized the observations into three clusters 
with 6, 3, and 4 elements, respectively. Of the 13 observe 
tions, judge Y placed six in one group and seven in another 
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The labeling of the clusters is arbitrary. The upper left 
entry in the table indicates that five of the objects in 
judge X's category A matched with five of judge Y's category 
A. The entire table is interpreted in this manner. Notice 
that if one chooses to call this entry of five as representing 
agreement/ then the entry of 1 below it and the 1 in the 
top right corner must represent some of the observations 
on which the judges disagreed. 

The contingency table idea is easily generalized to 
higher dimensions (more than two judges) , In three dimen- 
sions, a box (or cube) would represent the table, with 
elements internal to the box measuring agreement between 
three judges. 

One method for measuring the degree of agreement is to 
find the largest combination of entires such that only one 
per row and one per column are chosen. This task becomes 
very difficult as the nurober of clusters increases, but it 
can be solved through the use of linear programming tech- 
niques. It is a constrained optimization with a linear 
objective function and is an application of the "assignment 
problem." Unfortunately, when generalizing to higher dimen- 
sions, the L.P. loses its unimodularity attribute and the 
nvmiber of constraints and variables in the problem becomes 
prohibitively large. 

The Chi-square contingency statistic was considered 
inappropriate because, when using the smaller sample sizes, 
more than 20 percent of the cells have expected frequencies 
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of less than five (see ref. 14). Even when using the 
190 element sample, there were frequent occasions when this 
same difficulty persisted. The Chi-square statistic was 
not used since it could not have been applied consistently 
throughout the analysis. 

Professor James Hartman provided an idea that led to 
the method finally put into use. 

B . THE TECHNIQUE 

The idea was to sum the squares of the entries in the 
contingency table and then "normalize" this quantity. 
Summing squares offers an excellent method for measuring 
the degree of association, however the following example 
illustrates the need for some sort of adjustment factor. 



TABLE 9 



19 


0 


0 


1 



9b 



10 


0 


0 


10 



9a 



Both tables represent perfect agreement on 20 observations, 
however table 9a has a sum of squares equal to 200 and 9b 
has a value of 362. It is desired to indicate both of 
these examples as perfect agreement with one being no 
better than the other. Hence, it became necessary to 
determine the "best possible" sum of squares in every given 
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situation. A computer program was written for this purpose 
and is included in Appendix F. The statistic which is 
used as a comparison coefficient is a number between zero 
and one, formed as the ratio of the actual sum of squares 
to the "best possible" s;im of squares. The best possible 
sum of squares is a computed sum using a minimax approach and 
is based on the number of judges, niimber of clusters by 
eafch judge, and the number of observations within each 
cluster. The minimax procedure does not need to know which 
observations make up a cluster, only how many observations. 

An example showing the computation of the comparison 
coefficient is given in Appendix E. 

This method for measuring the degree of agreement 
provides the analyst a standard scale upon which to compare 
coefficients based on solutions involving varying numbers 
of clusters and cluster memberships, as well as varying 
numbers of judges. 

In order to provide some sensitivity for the signifi- 
cance of this measure, several cluster solutions were 
formed wholly at random and compared to results produced 
by MIKCA and the judges. In every case, the values of the 
comparison coefficients were less than 0.1. 
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VIII. SUMMARY AND CONCLUSIONS 



This research has been largely exploratory. A path 
has been paved for others to follow in examining the SOF 
data. The theory of cluster analysis and its relationship 
to discriminant theory have been carefully examined with 
emphasis on two widely divergent- techniques . In the analysis 
of the data, attempts have been made to identify the under- 
lying structure of which the clusters are a consequence. 

This chapter is devoted to separate discussions of the 
cluster methodology explored and the interpretation of 
the SOF data. 

Although the development of methodology phase of the 
research was carried to completion in a general sense, a 
number of problems were encountered along the way. Many 
of these problems are deserving of deeper treatment and 
are discussed below. 

The data transformation was the best of the three con- 
sidered. It produced the smallest test statistic for homo- 
geneity of covariances, but the value itself was not in 
the acceptable range, based on normal theory. It should 
be possible to improve the choice. 

The modifications of MIKCA to allow for weighting of 
the input vectors has been effected and well tested. It 
is an important added capability for this program. 

The use of discriminant analysis to discover the 
important variables affecting the clustering is, no doubt, 
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not new. It needs some refinement, however, because it is 
not clear how one should rate the importance of variables 
supporting the first dimension to those supporting the 
second (or any other dimension) . Such a set of priorities 
could be most useful. 

The idea of using the important variables (and their 
signs) of the discriminant functions in the problem of 
assigning sets of variables to sets of features is believed 
to be new. It may have great potential in providing a way 
for the Chernoff face technique to replace the more expensive 
technique based on computer iteration. 

The present attempt to work with the faces was disappointing. 
This is due largely to the face that certain restrictions, 
truncations, and discontinuities in the movement of the 
features were not well docximented in our sources. Their 
discovery came as a surprise late in the research and it 
was not feasible to go back and readjust. Such readjustment 
is clearly called for and would require a substantial effort 
in the future studies. 

The coefficient of comparison was a new idea and there 
was insufficient time to explore its properties. What is 
needed is more investigation in order to interpret its 
various values (or another measure whose values are inter— 
pretable) . The comparison measure is also useful in the 
problem of assigning variables to features when working with 
faces. The goal is to choose assignments having the property 
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that the judges are in good agreement when forming the 
clusters . 

The study of the SOF data which was made while developing 
the cluster methods produced results about student evalua- 
tions of courses (and instructors) . The results are 
discussed below. 

A principal components analysis of the data swarm of 
mean vectors showed it to be essentially one dimensional and 
having the direction of the main diagonal of 13-space. The 
interpretation os this is that all 13 items are equally 
important in the students ' perception of rating the course 
and its instructor. On the other hand, this same effect 
would be produced by careless, perfunctory completion of 
the forms by many students. 

The partitioning of the data into three or four clusters 
by MIKCA is more or less successful. The clusters are not 
sharply separated (there are no great voids between them) . 
Study should be made to see how much the density of the 
data diminishes near the boundaries of the partitions. 

Although the main data swarm is essentially one dimen- 
sional, it appears useful to use two dimensions to describe 
the individual partitions after clustering. In doing this, 
variable 12 (overall rating of the instructor) emerged as 
most important in the first dimension and variables 3, 8, 
and 11 giving support in the second. Only one discriminant 
study is reported here, although several were performed. 
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Variable 12 appears to have permanence while the other 
variables often shift in importance. 

The cluster profiles which track the cluster centroids 
over the 13 variables provide a set of (almost) horizontal 
lines. This supports further the one dimensional interpre- 
tation of the data swarm. The result is not sensitive to 
whether or not the data are standardized. 

The results of applying the modified MIKCA did not vary 
greatly from the results of applying the original MIKCA. 

Hence the number and composition of the clusters is not dis- 
turbed much by the variability in class size. 

The relative position of the clusters is strongly and 
inversely related to class size. The courses that receive 
uniformly high ratings are associated with the small class 
sizes and the courses receiving uniformly poor ratings are 
associated with the large class sizes. 

All judges reported use of a hierarchical approach to 
separate the faces into clusters. Most judges first separated 
the faces into two groups according to the curvature of the 
mouth (smile or frown) . There was little agreement about 
which features were important in further subdividing the 
two main groups, hence some disparity resulted in their 
final cluster solutions. 

The MIKCA procedure is a sophisticated approach to cluster 
analysis; its results are based on sound statistical theory. 
The modified version of that computer program is considered 
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particularly well suited to the SOF data or any other data 
set possessing the same predetermined class structure. The 
impact of class size on cluster membership has been empha- 
sized. This important issue may indicate the smaller classes 
receive artificially inflated SOF scores. Consideration to 
this fact surely must be given by those who use these scores 
as a means for evaluating teacher performance. 
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APPENDIX A 



Test for the Equality of Dispersion Matrices of k Groups 

Given a sample of k groups and m variables with group 

dispersion matrices , S^, (i = 1, . . . , k) pooled within- 

groups dispersion matrix S , and total sample observations 
k 

N = ^ N , Box shows that the hypothesis 

g=l ^ 



H 



r = I: 



• • = I, 



may be tested by an F statistic developed from 



B 



C 



D 



E 



k 

ln[ Is^l ] • [N - k] - I (N^ - !)• ln(|s^|) 

i=l 



k 

^ ^ 1 + N. “ N - k^ * + 3m - 1) 

i=l ^ 

6 (k - 1) (m+ 1) 



k 

[ I ^ 5 - — j] • (m-1) • (m + 2) 

i^ld + N^)^ (N - k)^ 

6 (k - 1) 



(k - 1) * m • (m + 1) 
2 



D + 2 

abs B^ - C 
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If > C, 



the test statistic is 



r A(1 - B + 2/E) -■ 

- A(1 - B + 2/E) ^ 



If C > B 



2 



the test statistic is 



F 



D 

E 



(|) (1 - B - D/E) Fg 
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APPENDIX C 



The following information is taken directly 
from Chernoff's technical report [9] 

Construction of Faces 

Given 18 numbers (x^,X 2 / • • • /X^g) in appropriate ranges 
(which will usually be 0 to 1) , we define a face (see 
Fig. 3) as follows. Let H be a nominal distance and let 
h* = i(l+Xj^)H be the distance from the origin to a "corner" 
point P. As x^ varies from 0 to 1, h* varies from H/2 to 
H. Let 0* = (2 x2-1)tt/ 4 be the angle of OP with the hori- 
zontal. Let P' be a point symmetric to P about the verti- 
cal axis through 0. Let h = j(l+Xg)H represent the distance 
from 0 to U the top of the head and L the bottom of the 
head, both on the vertical line through 0. The upper part 
of the head is an ellipse which is determined by P', U, 
and P and an eccentricity x^. Let x^ represent the ratio 
of the width to height of the upper ellipse. Similarly, 

Xj is the same ratio for the ellipse through P', L, and P. 

The nose is a vertical line of length 2hx^ with 0 as center. 
The mouth intersects the vertical line extended through the 
nose at a point P^^ whose distance below 0 is h [x^+ (1-x^ ) Xg] . 
This represents a point x^ part of the way from the bottom 
of the nose to U. The mouth is part of a circle whose center 
is h/Xg above P^. Thus a positive value of Xg yields a 
smile. The mouth is symmetric about the vertical axis 
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L 



through 0. Its projection on the horizontal axis has 
the half-length a^ = XQ(h/|xQ|) unless (h/|xo|) exeeds the 



‘8 



8 



half-width w of the face at the height of P . 

m ^ m 



In that 



case is used. The eyes are located at a height 

Yg = h [x^Q+ ( 1-x^Q ) Xg] above 0 and at centers which are 

= w (l+2x, ■|)/4 from the vertical axis where w^ is the 
e e 11 e 

half-width of the face at the height y^. They are symmetrically 

slanted at an angle 6 = ( 2 x^ 2 ~l) with the horizontal. 

The eyes are ellipses with eccentricity x^^ (height/length 

before slanting) and half-length = x^^min (x^ ,w^-x^) . 

The only asymmetry appears in the location of the pupils 

which move together an amount r^(2xj^g-l) from the center 

2 22 — 1/2 

of the eye where r^ = (cos 0 + sin S/x^^^) ' Lg is the 

horizontal half-length of the slanted eye at height y^. 

Finally the eyebrows are symmetrically located with 
centers at a height yj^ = 2 (x^^g+. 3) L^x ^2 above the eye centers 
and slant 2 (x^^^-l) tt/ 5 with respect to the eye, i.e., 

9** = 9+ (2Xj^^-l) tt/ 5 with respect to the horizontal and 
half-length I^ = r^ (2x^g + 1) /2 . 

One final step taken by the programmer and which has 
been left intact, is to normalize both horizontal and 
vertical axes, each by a multiplicative factor, so that 
the width of the head at its widest part and its height 
are both equal to a specified constant. This step, which 
essentially removes two degrees of freedom, was left 
unaltered for intuitive and aesthetic reasons that are 
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somewhat vague and may require reconsideration when dealing 
with 18-dimensional data. In the meantime, the effects 
of and are almost but not completely eliminated 
because of the secondary effects of the normalization, 
which will adjust all of the other features at the same 
time as the width and height are normalized. 

Most of the parameters are adjusted to range within 
a subinterval of (0,1). The exceptions are two of the 
eccentricities, x^ and x^, and the parameter controlling 
curvature of the mouth, Xg . Ordinarily, x^ and x^ are kept 
'within 1/2 to 2, and Xg is kept within (-5,5). The eccen- 
tricity of the eye x^g has usually been kept within (.4,. 8). 
Some of the ranges must be controlled carefully. We do not 
want negative length eyes. Others need not be so carefully 
controlled. It is no calamity to have eyes extend beyond 
the face. 

When the two ellipses of the head meet smoothly, the 
corner point P is lost, and the variable Xg loses effect. 
Restricting x^ and x^ to widely separated ranges seems to 
avoid this problem. 

Data are converted to the x parameters as follows. If 
the variable Z is used to control the parameter x^, which 
is to be allowed to range from a^ to b^, we let 



X 



a . + 
1 



(b.-a.) 



Z - m 
M - m 



where m and M are the observed minimum and maximum of Z . 
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Formulae Used on the Construction 



We describe a few of the less trivial formulae used 
in the construction of the faces . 

The point P has coordinates = h* cos 9* and = h* sin 9*. 
The ellipse through PUP' has equation 




(y 




1 



where 



and 



b = h - c , 
u u 



a = X .b 
u 4 u 



X 



u 



f[(h^-yo) 






The ellipse through PLP' has equation 



■ =l’ _ , 

“2 72 " ^ 



where 



= h + , 

* ^5*"l 

and 
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jK-h+y^) 



X 



X5(-h-yo> 



The head is then described by (±x(y) ,y) where 



x(y) = [b^- (y-c^) 



y < y < h 



r,2 , x2,l/2 

X5tb^-(y-Cj_) ) 



-h < y < y 



The mouth is a circular arc with curvature |xg/h| 

through (0,y ) where y = -h(x_+ (l-x-,)x,). It is 
m in / / o 

described by 

y = y„ + (sgn Xj) 

0 < X < a 

— — m 




where 



^m ^ Xg min[x(yj^) ,h/lxg 1 ] . 

The eyes are nominally centered at (x^/Y^) where 

^ [l+2Xj_j_]/4 
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and have half-length 



^e ^ ^14 n^in[x^,x(y^)-x^] . 

Let (u,v) be the coordinates of an ellipse with center at 

the origin, half-length and eccentricity Then 

2 2 1/2 

V = x^ 2 (i* “ u ) describes part of the ellipse. A 
similar part of the slanted eye can be described for 
0 < u < L by 



X = X + u cos 9 - V sin 9 
e 

y = y + u sin 9 + v cos 9 

and symmetry is used to complete both eyes. 

To place the pupils within the eyes, both are moved 

a distance r^(2Xj^^-l) from the center of the eye, where 

tg, the horizontal half-length of the slanted eye at 

2 2 1/2 

height y^, is (u +v ) ' when v/u = tan 9. This yields 

r = L (cos^ 9 + X, ? sin^ 9) 
e e i J 

The program then normalizes all heights and widths by 
multiplicative factor k/h and k/max x(y) respectively. 
Currently k is set at 2 inches. 
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APPENDIX E 



An Example of the Comparison Coefficient 



Given two judges who cluster 20 observations (numbered 
1 through 20) into groups as shown below; 



Judge X 

Cluster 1 1,2, 3, 4, 7 

Cluster 2 5,6,11,12,13 

Cluster 3 8,9,10,14,15, 

16,17,18 

Cluster 4 19,20 



Judge Y 

cluster 1 5,6,7 

cluster 2 1,2,3,4,9,15 

cluster 3 8,11,12 



cluster 4 10,13,14,16,17, 

18,19,20 



The contingency table appears below with marginal (row) 
totals . 



Judge X 





1 


2 


3 


4 


1 


1 


2 


0 


0 


Judge ^ 


4 


0 


2 


0 


Y 2 


0 


2 


1 


0 


4 


0 


1 


5 


2 




5 


5 


8 


2 



3 

6 

3 

8 

20 



Step 1: Find the sum of squares of the table entries. 

1+4+16+4+4+1+1+25+4 = 60 
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step 2; Find best possible sum of squares. 
(Read in two columns) 



Judge X Judge Y 



# obs in 


clusters # obs 


in clusters 


Subtracting 


3 


• 


5 


0 


0 


6 




5 


1 


0 


3 




8 


1 


0 


_B 




2 


_0 


2 


Max 8 




8 


Max 1 


2 


Min of Max's = 8 




Minimax = 


1 


Subtract 


Minimax from 








the max ' 


element and repeat 




Subtracting 


3 


5 




0 


0 


6 


5 




0 


0 


3 


0 




1 


0 


_0 


2 




0 


1 


Max 6 


5 




Max 1 


1 


Minimax 


= 5 




Minimax 


= 1 



Subtracting 



3 0 

1 5 

3 0 

_0 _2 

Max 3 5 



Minimax = 3 

Subtracting 



0 0 

1 2 

3 0 

_0 _2 

Max 3 2 



Minimax = 2 



Subtracting 

0 0 

0 0 

0 0 

_0 _0 

Finished Step 2 
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* 










Step 3: Sum the squares of Minimax' 


' s 


54 + 25 + 9 + 4 + 1 + 


1 = 104 



Best possible sum of squares = 104 



Comparison coefficient = 


Actual sum of squares 

Best Possible Sum of Squares 


= 


_ 05° 

104 
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